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ABSTRACT 

Investigated empirically through post mortem 
item-examinee sampling were the relative merits of two alternative 
procedures . for allocating items to subtests in multiple matrix 
sampling and the feasibility of using the jackknife in approximating 
standard errors of estimate. The results indicate clearly that a 
partially balanced incomplete block design is preferable to random 
sampling in allocating items to subtests. The jackknife was found to 
better approximate standard errors of estimate in the latter item 
allocation procedure than in the former. These and other results are 
discussed in detail, (Author) 
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ABSTRACT 



A KOTE ON ALLOCATING ITEMS TO SUBTESTS IN MULTIPLE MATRIX SAMPLING AND 
APPROXIMATING STANDARD ERRORS OF ESTIMATE WITH TRE JACKIONIFE 



o 



DAVID M. SHOEMAKER 

Southwest Regional Laboratory for Educational 
Research and Development 



Investigated empirically through post mortem item-examinee sampling 
were the relative merits cf two alternative procedures for allocating 
items to subtests in multiple matrix sampling and the fes,sibility of 
using the jackknife in approximating standard errors of estimate. The 
results indicate clearly that a partially balanced incomplete block 
design is preferable to random sampling in allocating items to subtests. 
The jackknife was found to better approximate standard errors of estimate 
in the latter item allocation procedure than in the former. These and 
other results are discussed in detail- 
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A NOTE ON ALLOCATING ITEMS TO SUBTESTS IN MULTIPLE MATRIX SAMPLING AND 
APPROXIMATING STANDARD ERRORS OF ESTIMATE WITH THE JACKKNIFE 



DAVID M. SHOEMAKER 

Southwest Regional Laboratory for Educational 
Research and Development 

Multiple matrix sampling or, more popularly, item-examinee sampling, 
is a procedure in which a set of K test items is subdivided randomly 
into subtests containing k items each with each subtest administered 
to n examinees selected randomly from the population of ^ examinees. 
Although each examinee receives only a proportion of the K test items, 
the equations given by Hooke (1956) and Lord (1960) permit the researcher 
to estimate parameters of the test score distribution which would have 
been obtained by testing all N examinees over all K test items. Because 
numerous combinations of ^, k, and n arr feasible in any investigation, 
the researcher must come to grips with several questions about how the 
procedure should be implemented. ''How should items be allccated to 
subtests?" is one important question requiring an answer and is the one 
addressed specifically herein; concomitantly, the feasibility of using 
the jackknife procedure for approximating standard errors of estimate 
in multiple matrix sampling is considered in some detail. 

A basic requirement in multiple matrix sampling is that k items 
from the K-item population are allocated randomly to each subtest. 
However, in constructing the ^ subtests, four general item allocation 
procedures are possible each of which is described more appropriately 
as restricted random sampling . The four procedures and concomitant 
restrictions are listed in Table 1 and an example of each procedure is 
given in Table 2 for k = 3 and K = 7. 



Please insert Tables 1 and 2 about here. 



Procedures 1, 2 and 3 are implemented easily in practice; Procedure 
4, hcwever, is more difficult and the degree of difficulty increases 
with increases in K. Within the context of the design of experiments. 
Procedures 3 and 4 are referred to, respectively, as a "partially 
balanced incomplete block" design (PBIB) and a "balanced incvomplete 
block" design (BIB). That which is "partially balanced" or "balanced" 
by each design is the item pairings. In the BIB design, all possible / 
item pairings occur among subtests and they occur with equal frequency;/ 
in the PBIB design, item pairings do not occur with equal frequency and, 
indeed, some item pairs may be excluded completely. A BIB design is 
often difficult to implement because, for a given K, no design may 
exist, or, if there is a design, the number of subtests required is 
excessively large. This limitation is most serious when K exceeds 50 
even permitting minor adjustments in K to fit an available design. For 
example, when K = 91 and k = 10, 91 subtests would be required; for 
K = 97 and k = 10, 4656; and, for K ^ 199 and k = 10, 19701. The first 
of these three BIB designs is cited and illustrated by Cochran and Cox 
(1957); the other two are given by Ramanujacharyulu (1966) and cited by 
Knapp (1968a). Although BIB designs have been used on a few occasions 
(e.g., Knapp, 1968a, 19S8b) when K was small (i.e., 43, 29 and 13 with 
Knapp), such designs are ill-suited to large item population's. This 
point is of no minor import because one of the major reasons for using 
multiple matrix sampling is its potential for dealing with large item 
populations. Because of this, it is expected that the majority of item 
allocation procedures in multiple matrix sampling will involve Procedure 
1, 2 or 3, 
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It should be noted that, in practice, Procedures 1, 2^ and 3 are 
Implemented typically in conjunction with item stratification, that is, 
a stratif ied-random sampling procedure is used with the stratification 
being on item content, item difficulty level or both item content and 
item difficulty level. The relative merits of such stratification 
procedures have been discussed previously (i.e.. Shoemaker and Osburn, 
1968; Kleinke, 1971) and are not considered Tiere. 

Of principal interest in this investigation vere the relative 
merits of Procedures 1 and 3. Procedure 2 was excluded because it is 
used rarely in practice. The metric by which these two item allocation 
procedures were c(?ntrasted was the standard error of estimate. 

METHOD 

The research design was one of post mortem item-examinee sampling 
vith the required data bases generated through a computer simulation 
model described previously by Shoemaker (1971). In post mortem item- 
examinee sampling, various samples of items and examinees are selected 
randomly from a data base (an item by examinee matrix) and used to 
estimate parameters of the base from which thfey have been sampled. The 
researcher acts as if only certain examinees have been tested over 
certain items knowing all the while the results obtained by testing all 
examinees over all items. 

Parameters of the data base manipulated systematically were: (a) the 

number of test items (K = 40, 60), (b) variance of the item difficulty 
2 

indices (o ^ = .00, .05), (c) reliability of total test scores (or = .80, .90), 
and (d) degree of skewness in the normative distribution (distributed 
normally, markedly negatively- skewed). When the distribution of test scores 
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was negatively- skewed, only = ^00 was used. The selection of parameters 

was not unrelated to that encountered frequently in practice. It is 

well-known that when items are scored dichotomously the variance of the 

item difficulty indices for most standardized achievement tests (whose 

test scores are frequently distributed approximately normally) ranges 

typically from .04 to .08 and the corresponding value for markedly- skewed 

distributions of test scores (e,g., those resulting from pretests, posttests, 

and ''criterion-referenced'' tests) is approximately zero. The reliability 

coefficients selected are not unusual and span a familiar range. The 

procedure used in this investigation to generate data bases was costly 

and, for this reason, data bases having 40 and 60 items were generally used. 

However, to determine the degree of generalizability of results obtained 

using these data bases, several additional sampling plans were used on 

bases haT^ing 100 items (K = 100), 

The nine item-examinee saupling plans used on data bases having 40 

and 60 items are listed in Table 3, For several of these sampling plans, 

the nutnber of examinees per subtest was varied systematically (n = 10, 20, 

30 and 40) to determine the degree of generalizability of results obtained 

2 

when n = 50 to other values of ji. A PBIB design was used only when o > 0 

2 

for a given data base. When o ^ = 0, all items are statistically parallel 
and Procedures 1 and 3 produce equivalent results (and all differences 
observed between the two procedures would be due to the sampling of examinees.) 

The parameters estimated were [i^ (the mean test score), M-^? 

2 

(thfi second through fourth central moments) and CT^. Estimating moments 
of the test score distribution is important in multiple matrix sampling 
because they are the required statistics in graduating the normative 
distribution — one of the major objectives of multiple matrix sampling. 

EKLC 
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The equations used to estiiaate the moments of the test score distribution 

2 

were those given by Lord (1960); o ^ was estimated through a components 
of variance analysis. The results of each sampling plan were replicated 
50 times. 

The Jacklgiife Procedure 

Of additional concern in this investigation was examining the 
feasibility of a statistical procedure known as the "jackknife" in 
approximating standard errors of estimate in multiple matrix sampling, 
A good description of the jackknife is given by Mosteller and Tukey (1968) 
and some preliminary results in applying the procedure to multiple matrix 
sampling are given by Shoemaker (1972a). In general, the jackknife operates 
on a data base which has been divided into subgroups of data and produces 
a mean estimate of the Parameter and approximates the standard error of 
estimate associated with this statistic. The basic component of the 
jackknife is the pseudovalue associated with each subgroup which is the 
weighted difference between the statistic computed on all the data and 

the statistic computed on the body of data which remains after omitting 
that subgroup. Because the pseudovalues behave as though they were 
independent of each other, the standard error of the statistic is 
computed according to the well-knovn formula for the standard error of 
a sample mean* When the jackknife is applied to multiple matrix sampling 
there are Jt subgroups of data but only one score (the estimated parameter) 
for each subgroup with that statistic weighted according to the number of 
observations ^ acquired by that subtest. The jackknife operates on the 
statistics obtained from one set of t subtests and approximates the 
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variability of the pooled estimates which would have been observed over 
repeated replications of the design. 

The computations involved in the jackknife are relatively simple. 

Let 

t » the number of subgroups (subtests), 
y^jj the statistic computed on all the data, and 
^(j) * statistic computed on all the dat^ left after 
lemoving subgroup j . 
The pseudcvalues, y are then equal to 

y*j " - (t - Dy^^j for j - 1, 2, ... , t. A 

The jackknifed estimate of the parameter is equal to 

with an estimate of its variance given by 

t(t - 1) 

The procedure used in this investigation for testing the jackknife 
was relatively straight-forward* Because each sampling plan was replicated 
times, £ estimates of each parameter were produced as well as estimates 
of the jackknifed standard error for each parameter. At the end of v 
replications, t wo estimates of the standard error of estimate for each 
parameter for each sampling plan vere computed. The first estimate was 
obtained by computing the standard deviation of the v estimates of each 
parameter; the second, by computing the mean of the r jackknifed standard 
errors for each parameter. The jackknife is justified to the degree that 
the two standard errors agree. 
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RESULTS 

The interrelations among standard errors obtained when cy = ,80 were 
vex^y similar to those obtained when a = ,90 and, for this reason, only 
those results obtained wht2n or .80 are reported in detail in Tables 3 
and A, The only difference observed between the two data sets was that, 
result for result, the standard errors of estimate per item- examinee sampling 
plan were generally larger for the higher reliability. This increase was 
not unexpected and was consistent with previous results reported by Shoemaker 
(1972b). Concomitantly and to conserve space, only results obtained for 
Jij^ and tabulated. There is no loss of information here because 

results similar to Si*^ were obtained for and cr^. Although 

not reported in detail here, the results obtained using data bases having 
100 items QC = 100) and item-examinee sampling plans involving exam?.nee 
subgroups cf size 10, 20, 30 and 40 suggest strongly that the conclusions 
drawn here are generalizable to a variety of data bases and to a variety 
of item^examlnee sampling plans. 



Please insert Tables 3 and 4 about here. 



The entries in Tables 3 and 4 are interpreted similarly and only 
those for one sampling plan in Table 3 need be described in detail to 
explain both tables. The first three entries in the first row of Table 3 
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give the parameters of the data base. In this case, the item population 

consisted of 40 items, the variance of the item difficulty indices 

(p * proportion answering the item correctly) was equal to 0 and the 

test scores were distributed normally. Using a (t = 4/k - lO/n - 50) 

item- examinee sampling plan with random allocation of items to subtests 

(Procedure 1 in Table 1) and re?;licating the sampling plan 50 times, the 

standard deviation of the 50 pooled estimates of the mean test score on 

the 40-item test was equal to •4695, Fifty jackknifed estimates of the 

standard error of the mean were produced. Their mean was equal to ,4793; 

their standard deviation, .2445. If the items for each subtest had been 

allocated using a PBIB design (Procedure 3 in Table 1), corresponding 

results would have appeared under 'PBIB' in the first row. None are 

2 

given there because 0 and the two item allocation procedures are 

equivalent. 

Looking at all results fov SE(R), it was generally the case that, for 

each sampling plan, the standard error of estimate was less when a PBIB 

design was used. The relative magnitude of this discrepancy was greater 

for the mean test score and decreased sharply for successively higher 

central moments. Because several combinations of and k (for a given 

t^) occurred among sampling plans, it was possible to examine the effect 

of certain combinations on the standard error of estimate. For a given 

tk^ an increase in ^ resulted in a decrease in SE(R) when estimating the 

mean test score; for the second through fourth central momenv:s, an 

Increase in k resulted in a decrease in SE(R); and, for o no trend was 

P 

discemable. 



Regarding the jackknife, the results indicate that on the average 
It did approximate well standard errors of estimate* A major exception, 
and one noted previously by Shoemaker (1972a), was found in estimating 
the standard error of the mean t<ist score using a PBIB design where the 
jackknife consistently and markedly overestimated SE(R). However, the 
jackknife did appr^^ximate well the standard error here wren a random 
samplL.«g design was used to allocate items to subtests* Looking at the 
results across parameters, it was generally found that, when a PBIB 
design was used, the jackknife over(*stimated standard errors of estimate. 
This did not occur when a rsndom sampling design (Procedure 1 in Table 1) 
was usedr The relative discrepancy was most marked for the mean test 
score and decreased in magnitude for successively higher central moments. 
In a manner slmii^x to SE(R), the standard deviation of the jackknifed 
estimates of the standard error SD(J) decreased with Increases in _t when 
estimating the standard error of the mean test score and decreased 
generally with Increases in when estimating the standarvl errors of 
the higher central moments for a given tk , 

DISCUSSION 

The results support the conclusion that the procedure for allocating 
items to subtests In multiple matrix sampling is an important considera- 
tion* Specifically, a partially balanced incomplete block design is 
preferable to a random allocation for sampling plans having the same Jtk. 
The superiority of the PBIB is most apparent in estljaiating the mean test 
score and becomes less apparent in estimating higher central moments. 
This reinforces a conclusion made by Lord and Novlck (1968) that in 
estimating the mean test score omitting even one item has a drastic effect 
on the standard error of estimate. In this investigation, a PBIB design 
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guaranteed that each of the K items vas included in some subtest. Such 

was not the case with a random allocation of items where it was quite 

possible for certain items to be omitted completely (as happened to 

item 2 in Procedure 1 in Table 2). The results indicate that the Lord 

and Kovick conclusion is applicable to higher central moments but the 

expected discrepancies are not as drastic a$ those expected with th^ 

mean test score. 

Oi: additional interest in this investigation was the use of the 

jackloiife in approximating standard errors of estimate in multiple 

matrix sampling. The results reinforce the conclasion drawn by 

Shoemaker (1972a) that the jackknife can be used for this purpose and 

also shed light on a problem mentioned therein. Shoemaker noted that 

the jackknife overestimated the standard error of the mean test score 
2 

when = .05 and items were allocated to subtests using a PPIB design. 
The results in Table 3 suggest that the inability of the jackknife to 
perform well in this case was a function of the item allocation procedure. 
For the jackknife to be appropriate, the pseudovalues must behave as though 
they are independent and the results suggest that this requirement is 
violated with a PBIB design. Regarding this violation, the jackknife 
is not as robust x^hen estimating the standard error of the mean test 
score as it is in estimating standard errors of higher central moments. 
The conclusion seems warranted that, when departs significantly from 
zero and a PBIB design is used to allocate items to subtests, the 
jackknife will approximate conservatively the standard error of estimate 
in inultiple matrix sampling. It works quite well for all other cases. 
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TABLE 1 

Procedures for Allocating Items to Subtests in Multiple Matrix Sampling 



T t em A i 1 1 t n n 
proced i 


Restrictions On tk 


Restrictions On 
Sampling Of Items 


1. Random Sampling 


None 


Without replacement 






within each subtest 






With replacement 






among subtests 


2. Partially 


tk < K 


Without replacement 


Balanced 




within each subtest: 


Incomplete 






Block Design 




Without replacement 


(not all items 




among subtests 



tested) 



3. Partially 
Balanced 
Incomplete 
Block Design 
(all items 
tested) 



tk > K 

tk = rK (r integer.) 



Without replacement 
within each subtest 

Each of the K items 
appears with equal 
frequency (r) aiMng 
subtests 



4* Balanced 
Incomplete 
Block Design 



tk > K 

tk = rK (r integer) 
tu . K(K - 1)X 
' k - 1 
(X integer) 



Without replacement 
within each subtest 

Each of the K(K - l)/2 
item pairings appears 
with equal frequency 
(X) among subtests 



TABLE 2 



Examples of Subtests Resulting From the Four Item Allocation 
Procedures Described in Table 1 Using k = 3 and K = 7 



:»UiJtest 

Number Procedure 1 Procedure 2 Procedure 3 Procedure 4 

1 1 3 5 123 '123 124 

2 345 -456 456 235 

3 135 7. 12 346 
A 147 345 457 

5 A56 671 561 

6 3 46 2 34 672 

7 367 567 713 
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